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(54) Method and apparatus for testing links between network switches 



(57) Atest link protocol which continuously monitors 
each link in a network to ensure that the link is correctly 
transmitting data. Each switch, or torus has at least one 
of two functional components: Send Test and Receive 
Test. The Send Test component monitors control codes 
at a torus link output. The Receive Test component mon- 
itors control codes at a torus link input. 

After a predetermined interval, the Send Test com- 
ponent makes a request to send a test Jink control code. 
The torus sends the test Jink code to the neighbouring 
torus, where it is removed from the data stream and sent 
to that torus' Receive Test. The Receive Test then gen- 
erates a response message and makes a request to 
send that message back to the originating torus. After 
receiving the message, the Send Test analyzes the 
message to detemriine whether the network link is work- 
ing correctly. An error is also declared if the Send Test 
does not receive a reply within a predetermined inten/al. 
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Description 

This invention pertains in general to data transmit- 
ting networks and more particularly to an error detection 
protocol for detecting errors between network switches. 5 

Large scale data processing networks typically 
comprise many different switches separated by commu- 
nication links. Each switch is linked to one or more other 
switches. These switches constantly receive and trans- 
mit data. Data transmission protocols are used to en- 
sure that data passes through the network without error. 

Data transmission protocols typically rely on control 
codes to ensure that all switches and links are function- 
ing correctly. For example, a transmitting switch will 
send a code indicating that data has been transmitted 
and a receiving switch will send a reply acknowledging 
that data was received. The transmitting switch will ex- 
pect to receive the acknowledgement within a pre-de- 
termined time period. Otherwise, a timeout error occurs. 

Often, a timeout error Is the first Indication of net- 
work failure. Such an error, however, does not indicate 
what the error was or isolate the part of the network that 
failed. Thus, it is difficult for a switch or network admin- 
istrator to determine which part of the network has failed. 
Moreover, switches waste valuable time waiting for a 
never-arriving acknowledgement code. 

Therefore, it is an object of the present invention to 
provide a method and apparatus for determining when 
a network error has occurred, where a network error has 
occurred and to provide a method and apparatus for iso- 
lating the location of a network error. 

This invention provides a network having an error 
detection protocol, said network comprising: a first 
switch; a second switch linked to said first switch; a send 
test component associated with said first switch for 
sending messages to said second switch and compar- 
ing messages received from said second switch with 
said send test component to detect network errors; and 
a receive test component associated with said second 
switch for responding to messages received from said 
send test component. 

In at least a preferred embodiment, a test link pro- 
tocol continuously monitors each link in the network to 
ensure that the link is correctly following a higher level 
protocol. Each torus, or switch, Is connected to at least 
one other torus by a pair of uni-directional links, one 
sending link and one receiving link. Each sending link is 
connected to the receiving link of another torus and vice- 
versa. Each torus has at least one of two functional com- 
ponents: Send Test and Receive Test. The Send Test 
component monitors control codes at the torus link out- 
put. The Receive Test component monitors control 
codes at the torus link input. The test link protocol is im- 
plemented for each pair of links In the network and tests 
each pair separately. 

After a predetermined interval, the Send Test com- 
ponent automatically makes a request to send a 
testjink control code. The torus sends the testjink 
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code to the neighbouring torus, where it is removed from 
the data stream and sent to that torus' Receive Test. The 
Receive Test then generates a response message and 
makes a request to send that message back to the orig- 
inating torus. After receiving the message, the Send 
Test analyzes the message to determine whether the 
network link is working correctly. If an error is detected, 
the Send Test issues an error message and shuts down 
the network link that the message used. An error mes- 
sage is also sent if the Send Test does not receive a 
reply within a predetermined interval. 

For a more complete understanding of the present 
invention, and the advantages thereof, reference is now 
made to the following descriptions of an embodiment 
thereof taken in conjunction with the accompanying 
drawings, in which: 

Figure 1 depicts a typical configuration of a torus- 
switch link interconnection; 

Figure 2 depicts a detailed view of a pair of input/ 
output links connecting two tori; 

Figure 3 depicts a sample state machine for track- 
ing control codes; 

Figure 4 depicts a detailed view of the logical bkx:ks 
in the Send Test component; 

Figure 5 depicts a state machine used to determine 
the maximum count of the Send Test counter; 

Figure 6 depicts a state machine used to control the 
behaviour of the Send Test component; and 

Figure 7 depicts a detailed view of the logical bk>cks 
in the Receive Test component. 

The present invention can be implemented in any 
network in which multiple switches are connected by 
logical data channels. For example, the invention can 
be Implemented in a large network having thousands of 
switches and connections or in a small network connect- 
ing computer peripherals to the system bus of a compu- 
ter system. 

Figure 1 shows a typical configuration of intercon- 
nected tori. A "torus" is a cross-point switch used to 
transmit Information across a network. Two tori are 
shown 110, 11 2. Torus 110 has an Output Link 11 4 con- 
nected to torus 1 1 2*s Input Link 1 1 6 via physical lln k 1 1 8. 
Likewise, torus 112's Output Link 120 is connected to 
torus 110's Input Link 122 via physical link 124. 

A pair of interconnected Input/Output Links 114, 
116. 120. 122 forms a logical data channel. The links 
themselves 118. 124 are herein referred to as "physical 
links," but may in fact be based upon any form of data 
transmission. Thus, the Input/Output Links 114, 116, 
120, 122 need not be physically connected. Each torus 
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110, 112 sends messages to the other torus via its Out- 
put Link 114. 120. The messages follow a specific pro- 
tocol. The protocol, in turn, is composed of control 
codes. The control codes provide information such as 
the size of the message or what action the receiving 
torus is to perform on the message. 

Figure 2 shows a detailed view of a pair of Input/ 
Output links connecting the tori 110, 112 of Figure 1. Al- 
so shown are the Send 210 and Receive 212 Test func- 
tional components. These two components 210, 21 2 are 
present for every pair of links 1 1 8, 1 24 between tori. The 
Send 210 and Receive 212 Test functional components 
implement the test link protocol between the tori. 

Each Output Link 114, 120 sends data from Its re- 
spective torus 110, 112 over a physical link 118, 124 to 
the Input Link 116, 122 of the other torus 112, 110. Out- 
put Links 114, 120 also process and act on requests 
from the Send 210 and Receive 212 Test components 
to send messages to the other torus. 

Input Links 116, 122 receive data from their respec- 
tive physical links 118, 124. In addition, the Input Links 
116, 122 detect certain control codes, remove those 
codes from the data streams, and then notify either 
Send 210 or Receive 21 2 Test of the codes' arrival. For 
example, Input Link 116 detects and removes testjink 
control codes from the data stream and notifies Receive 
Test 212 of their arrival. Further operation of the Input 
1 1 6. 1 22 and Output Links 1 1 4, 1 20 is described below. 

Send Test 210 is connected to torus 110 such that 
it can monitor messages sent over physical link 118. 
send messages to Output Link 114and receive messag- 
es from Input Link 122. Send Test 210 has three main 
functions: monitor control codes; determine when to 
send a test Jink message to Receive Test 212; and com- 
pare the state of Receive Test 212 to its own state. 

Send Test 210 monitors control codes sent by Out- 
put Link 11 4 over physical link 118. Send Test 210 uses 
a state machine to track these codes. The actual con- 
figuration of the state machine is dependant upon the 
protocol used by the network. Since the state machine 
will be used for error detecting, however, both Send Test 
210 and Receive Test 212 must use identical state ma- 
chines. 

Figure 3 shows a sample state machine for tracking 
control codes. This state machine has three possible 
states: "idle" 310; "control code one" 312; and "control 
code two" 314. The state machine transitions to each 
state when it detects the corresponding control code. 
For example, the state changes from "idle" 310 to "con- 
trol code one" 312 when Output Link 114 sends a 
control_code_1 over physical link 118. 

Send Test 210 determines when to send a testjink 
message to Receive Test 21 2. The messages can be 
sent after any predetermined time interval, but the best 
mode is described herein. If no message traffic is on 
physical link 118. Send Test 210 requests that Output 
Link 114 send a testjink every 65,536 clock cycles. If 
a message is on physical link 118. Send Text 210 re- 
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quests a testjink 64 cycles after the beginning of the 
message. Output Link 114 sends the testjink message 
to Receive Test 212 as soon as traffic permits. 

Receive Test 212 has two main purposes: monitor 

5 control codes carried by physical link 118 and send the 
current status of Receive Test 212's stale machine to 
Send Test 210 in response to a testjink message. 

Receive Test 212 is connected to physical link 118 
such that it can monitor control codes sent on the link. 

10 Receive Test implements a state machine like that 
shown in Figure 3 and updates it according to the control 
codes it detects. 

When Input Link 116 receives a testjink message, 
it notifies Receive Test 21 2 of the message. In response, 

15 Receive Test 21 2 formulates a rcvJest_response mes- 
sage containing the current state of the Receive Test 
212 state machine. Next, Receive Test 212 requests 
that Output Link 1 20 send the rcvjest_response mes- 
sage. Output Link 120 sends the rcvJest_response 

20 message to Input Link 122 via physical link 124. 

Send Test 210 then compares the state of Receive 
Test 212 with that of itself. In normal operation, both test 
components 210, 212 should be in the same state be- 
cause they implement the same state machine. If the 

2S components 210, 212 are in different states, a network 
error has occurred. In addition, an error occurs if a 
rcvJest_response message is not received by Send 
Test 210 within a predetermined time interval. 

30 Send Test 

Figure 4 depicts a detailed view of the logical blocks 
in Send Test 210. A decoder 41 0 and a counter 41 2 are 
connected to physical link 1 1 8. Decoder 41 0 is connect- 
35 ed to a Control Code Monitoring State Machine ('CC- 
MSM") 41 4, which, in turn, is connected to a comparator 
41 6. Counter 41 2 is connected to a Send Test Link State 
Machine ("STLSM") 418. A system clock, which gener- 
ates clock cycles, is also present but not shown. 
40 Decoder 410 monitors and decodes control codes 
sent over physical link 118. The control codes are then 
sent to CCMSM 414. CCMSM 414 implements a state 
machine like that shown in Figure 3. 

Counter 41 2 counts clock cycles. The limit of Coun- 
ts ter 41 2 is determined using the state machine shown in 
Figure 5. If a message has been sent on physical link 
118, Counter 412's maximum count is 64 cycles from 
the beginning ol the message. If STLSM 418 has re- 
quested that a testjink control code be sent, Counter 
50 412's maximum count is 32. Othenwise, Counter 412's 
maximum is 65,536. When Counter 412 reaches its 
maximum count, it sends a signal to STLSM 418. 

STLSM 418 determines when to request that a 
testjink message be sent and whether a network error 
55 has occurred. STLSM 418 implements the state ma- 
chine shown in Figure 6. The state machine has four 
states: "increase counter" 610; "request testjink" 612; 
test response" 614; and "compare states" 614. 
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STLSM 418*s state machine remains at "increase 
counter" state 610 until it receives a counter_atJimit 
signal from counter 412. Then, the state machine tran- 
sitions to "request test Jink" state 612. At slate 612. 
STLSM 418 sends a signal to Output Link 11 4 request- s 
ing that a testjink message be sent over physical link 
1 1 8. Also at state 61 2. STLSM 41 8 sets Counter 41 2 to 
32. Then, STLSM 418 transitions to "test response" 
state 614. 

At state 614, the state machine waits for either a io 
rcv_test_response signal from Input Link 122 or a 
counter_atJimit signal from counter 412. If a 
rcv__test_response signal is received, the state machine 
transitions to "compare states" state 614 and then back 
to "increase counter" state 610. A counter_atJimit sig- 
nal received while at state 614 indicates that 32 cycles 
have elapsed since the testjink message was sent. 
Therefore, an error has occurred because a response 
from torus 112 was not received within 32 cycles. 

Comparator 416 compares the stale of Receive 20 
Test 212 with that of Send Test 210. Comparator 416 
receives signals from Input Link 122 and from CCMSM 
414. The state of Receive Test 212 is embedded in the 
rcvJest_response signal received from Input Link 122. 
Using these two signals, comparator 41 6 compares the 2S 
state of Receive Test 21 2 with the state of CCMSM 41 4. 
If the states are different and STLSM 418 is in the com- 
pare states state 614, then a control code has been lost 
and a network error has occurred. 

30 

Receive Test 

Figure 7 depicts a detailed view of the logical blocks 
In Receive Test 212. A decoder 710 is connected to 
physical link 118. Decoder 710 is also connected to a 55 
CCMSM 712. Input Link 116 is connected to a Control 
Code Test Link Decoder ("CCTLD") 714. CCTLD 714, 
in turn, is connected to a latch 716. Both CCMSM 712 
and latch 716 have outputs 718. 720 connected to Out- 
put Link 120 (not shown in Figure 7). A system clock. 40 
which generates clock cycles, is also present but not 
shown. 

Decoder 710 monitors and decodes control codes 
sent over physical link 118. The decoded codes are then 
sent to CCMSM 71 2. CCMSM 71 2 implements a state 4S 
machine like that of Figure 3. In addition, CCMSM 712 
has an output 718 indicating its current state. This output 
718 is connected to Output Link 120. 

CCTLD 714 decodes testjink signals received by 
Input Link 116. A testjink signal lasts for only one cycle 
and, thus, so does CCTLD 714 's output signal. There- 
fore, CCTLD 714 sends its output to latch 716. Latch 
716's output creates a one cycle pulse which requests 
that Output Link 120 send a rcvJest_response mes- 
sage over physical link 1 24. \NY\en traffic permits, Output ss 
Link 120 sends a rcvjest_response message contain- 
ing the stale of control code monitoring state machine 
712. 



The rcvJest_response message is received by In- 
put Link 122. Input Link 122 detects and removes the 
message from the data stream and passes its contents 
to Send Test 210. As discussed above. Send Test 210 
then uses the message to compare the states of Send 
Test 210 and Receive Test 212. 

When Send Test 21 0 determines that a network er- 
ror has occurred, it will normally shut down the logical 
communication channel on which the error occurred or 
perform another predetermined action such as reset the 
logical channel or notify a network supervisor. Of 
course, other channels can still be used to send mes- 
sages through the network. 

There has been described a test link protocol which 
continuously monitors each link in a network to ensure 
that the link is correctly transmitting data. Each switch, _ 
or torus has at least one of two functional components: 
Send Test and Receive Test. The Send Test component 
monitors control codes at a torus link output. The Re- 
ceive Test component monitors control codes at a torus 
link input. 

After a predetermined interval, the Send Test com- 
ponent makes a request to send a testjink control code. 
The torus sends the testjink code to the neighbouring 
torus, where it is removed from the data stream and sent 
to that torus' Receive Test. The Receive Test then gen- 
erates a response message and makes a request to 
send that message back to the originating torus. After 
receiving the message, the Send Test analyzes the 
message to determine whether the network link is work- 
ing correctly. An error is also declared if the Send Test 
does not receive a reply within a predetermined interval. 

Although the present invention and its advantages 
have been described in detail, it should be understood 
that various changes, substitutions and alterations can 
be made herein without departing from the scope of the 
invention as defined by the appended claims. 



Claims 

1. A network having an error detection protocol, said 
network comprising: 

. a first switch; 

a second switch linked to said first switch; 

a send test component associated with said 
first switch for sending messages to said sec- 
ond switch and comparing messages received 
from said second switch with said send lest 
component to detect network errors; and 

a receive test component associated with said 
second switch for responding to messages re- 
ceived from said send test component. 
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2. A network as claimed in claim 1 . further comprising: 

a counter, said counter associated with said 
send test component, for determining when to send 
said messages to said second switch and to deter- 
mine when a reply should be received from said 
second switch. 

3. A network as claimed in claim 1 or claim 2, further 
comprising: 

a first state associated with said send test com- 
ponent; and 

a second state associated with said receive test 
component, wherein said receive test compo- 
nent sends said second stale to said first switch 
and said send test component compares said 
second state with said first state to detect said 
network errors. 

4. A network as claimed in any preceding claim, 
wherein said first and second switches are torus 
switches. 

5. A network as claimed in any preceding claim, fur- 
ther comprising: 

error handling means for controlling said first 
and second switches when said network errors are 
detected. 

6. A network as claimed in claim 5, wherein said error 
handling means further comprises: 

means for shutting down said network when 
said network errors are detected. 

7. A network as claimed in claim 5, wherein said error 
handling means further comprises: 

means for resetting said network when said 
network errors are detected. 

8. A network as claimed in claim 5, wherein said error 
handling means further comprises: 

means for notifying a network supervisor 
when said network errors are detected. 

9. A network as claimed in any preceding claim, fur- 
ther comprising: 



first input link by a uni-directional connection. 

10. A network as claimed in claim 9, wherein said uni- 
directional connections are physical links. 

5 

11. A method of error detection in a network having a 
first switch and a second switch, comprising the 
steps of: 

10 sending a message from said first switch to said 

second switch; 

determining, in response to said message, a 
state of a receive test component associated 
IS with said second switch; and 

replying to said message by sending said state 
of said receive test component to said first 
switch; and 

20 

comparing said state of said receive test com- 
ponent with a state of a send test component 
associated with said first switch to detect a net- 
work error. 

25 

12. A method as claimed in claim 11 , wherein said mes- 
sage is sent to said second switch after a predeter- 
mined time interval. 

30 13. A method as claimed in claim 11 or claim 12. further 
comprising the step of: 

detecting said network error if said reply is not 
received by said first switch within a predetermined 
time interval. 

3S 

14. A network switch arranged to operate in a network 
using a method as claimed in any of claims 1 1 to 1 3, 
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a first input/output link associated with said first 
switch, said first input/output link having a first so 
input link and a first output link; and 

a second input/output link associated with said 
second switch, said second input/output link 
having a second input link and a second output ss 
link, said first output link connected to said sec- 
ond input link by a uni-directional connection 
and said second output link connected to said 
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(54) Method and apparatus for testing links between network switches 



(57) A test link protocol which continuously monitors 
each link in a network to ensure that the link is correctly 
transmitting data. Each switch, or torus has at least one 
of two functional components: Send Test and Receive 
Test. The Send Test component monitors control codes 
at a torus link output. The Receive Test component mon- 
itors control codes at a torus link input. 

After a predetermined interval, the Send Test com- 
ponent makes a request to send a test_link control code. 
The torus sends the test_link code to the neighbouring 
torus, where it Is removed from the data stream and sent 
to that torus' Receive Test. The Receive Test then gen- 
erates a response message and makes a request to 
send that message back to the originating torus. After 
receiving the message, the Send Test analyzes the 
message to determine whether the network link is work- 
ing correctly. An error is also declared if the Send Test 
does not receive a reply within a predetermined interval. 
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